This project intends to create a model that could help explain what factors might lead to cancer deaths. To do this, we will look at a dataset which contains data for counties in the primary 50 states in the United States. For every county, the dataset includes variables such as cancer diagnosis rates, population information, income, age, education levels, family sizes, marriage rates, insurance coverage, and employment rates. To do this we will look to out infatuation, and domain experience in the health field to find variables that can help explain the death rate
This report will walk through the process that was taken to make an effective and explanatory model. To do this the dataset must first be analyzed. Then variables will be selected as predictors and used in an initial model. The model will be examined to determine what predictors should stay and what should be removed. The model will be checked to make sure it meets all assumptions and can be used to make statistical inferences. The model will be finalized and inferences will be conducted, assuming the model assumptions are met.
Plotting the distribution of the Death rate variable
After looking at the distribution we can see most of the counties have death rates per capita around 175. Because of the size of the datasest the variable is normally distributed. Looking specifically we see the average death rate below:
## [1] 178.6641
We can see the hugest death rates for counties is over 300 per capita of 100,000 and the lowest is around 60 per capita. We can also see a summary of the deathrate for easier viewing.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 59.7 161.2 178.1 178.7 195.2 362.8
Initially, I will explore the following variables in order to put them in to a model in future steps:
These variables look to be the most promising. After spending nearly a decade working in the insurance industry, with some time spent selling heath insurance products, the variables should hopefully provide a good explanation of death rate variable.
Lets look at six out of the eight variables plotted against Death
Rate
Looking at all of the plots generated, it looks like three of the six will have some correlation with the death rate. We should examine them all closer to see if this is the case.
There does not appear to be a positive or negative relationship, the data looks to be centered around 200 for the deathrate with values \(\pm\) 100 on the y-axis. There could be some outliers, which we will examine more specifically later.
The statistical summary of the avgAnnCount follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 76.0 171.0 606.3 518.0 38150.0
Due to how spread out avgAnnCount is, we should limit some of what
could be outliers
Looking at the limited plots there still is no trend distinguishable indicating any correlation between angAnnCount and Death rate.
Looking at the proportion and count of the number of counties and their average annual count of cancer diagnosis.There appears to be a strong positive relationship between IncidenceRate and Deathrate. There are a few outliers, possibly. Again we will examine this more thoroughly later.
Statistical summary of the Incidence rate
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 201.3 420.3 453.5 448.3 480.9 1206.9
There is a large spread amongst the smallest and largest rate reported.
There looks to be a moderate negative relationship between Median Income and Death Rate. Possibly a few outliers, if any.
summary(df$medIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22640 38882 45207 47063 52492 125635
Again there is a large disparity between the highest and lowest median incomes reported by all the counties.
It is difficult to see any relationship between MedianAge and Deathrate with the values in Median Age, as there are 20 or more that seem to be incorrect (median ages of individuals 350 years+)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22.30 37.70 41.00 45.27 44.00 624.00
It would be safe to assume that these are entry errors, but we will leave them alone for now but filter them out for purpose of looking for any trends in death rate and Median Age.
To get a better look to see if there is any trend when excluding the
outliers we will filter out values less than 250:
With a better look at the data between MedianAge and Deathrate, there does not appear to be a relationship.
There does not appear to be much of a trend here.
Looking at the statistical summary of studyPerCap:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 155.40 83.65 9762.31
From the statistical summary, we can see that most of the values are centered around 0. We will need to zoom in on the plot to make sure there is not any trend.
After seeing how many observations appear at 0, it would be helpful
to understand how many instances of 0 studies in a county there are:
The plot shows almost 2500 counties do not have any or very few studies being done locally.
A small table showing the specific counts and proportionsSimilar to the plot we
There appears to be a slight positive trend between those who only who only finished high school and the cancer death rates.
Of the 6 variables analyzed 3 of the 6 looked to have no correlation between them and our target variable Death Rate. The selection process might have been a bit naive as they were based not on analysis, but preconceived ideas. If this were to be repeated, it certainly seems possible that there could be better variables selected.
We will Fit a linear model with deathRate as the target variable and the variables chosen previously as the predictors.
##
## Call:
## lm(formula = TARGET_deathRate ~ avgAnnCount + incidenceRate +
## medIncome + MedianAge + studyPerCap + PctHS25_Over + PctPrivateCoverage +
## PctPublicCoverage, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -117.082 -11.968 0.441 11.655 140.421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.122e+02 6.283e+00 17.855 < 2e-16 ***
## avgAnnCount -8.341e-04 2.804e-04 -2.975 0.002957 **
## incidenceRate 2.346e-01 7.000e-03 33.515 < 2e-16 ***
## medIncome -1.991e-04 5.483e-05 -3.630 0.000288 ***
## MedianAge -7.115e-03 8.179e-03 -0.870 0.384405
## studyPerCap -7.880e-05 7.070e-04 -0.111 0.911259
## PctHS25_Over 9.206e-01 6.427e-02 14.325 < 2e-16 ***
## PctPrivateCoverage -8.782e-01 5.767e-02 -15.227 < 2e-16 ***
## PctPublicCoverage -1.106e-01 8.054e-02 -1.373 0.169935
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.38 on 3038 degrees of freedom
## Multiple R-squared: 0.4623, Adjusted R-squared: 0.4609
## F-statistic: 326.5 on 8 and 3038 DF, p-value: < 2.2e-16
The following variables, using an alpha of .05, are statistically significant:
The initial model has an \(R^2\) value of .4623.
5/8 of the variables selected were statistically significant. The most statistically insignificant variable was studyPerCap. I assumed this might be the most significant as it could indicate counties were research was being done for high levels of cancer rates and help the model with its predictions.
I also assumed the \(R^2\) value would be higher with 8 variables selected.
In this step we will apply two different automated methods of predictor selectors on the dataset:
Both methods take the previously created model, however they aim to find the same result, a smaller more efficient model by removing unnecessary predictors by different methods.
The fastbw() method we will use, uses each
predictors p-value, whether it is statiscially significant or not, and
removes predictors above a specific value of \(\alpha\) that we set. In this case we will
use \(\alpha = 0.05\)
The stepAIC() method will run through several
iterations. During each iteration, it will calculate the total AIC score
for the model predictor as if that particular model predictor had been
removed. The point being to find the lowest AIC score possible. Each
itteration a model predictor will be eliminated until removing
predictors no longer reduces the AIC score.
After running these two methods we will examine what predictors each method removed and make a decision on what model to move forward with.
- For each procedure, submit your comment on the variables that the procedure removed from or retained in your model. Think about the following questions to guide your comments:
- Does it match your intuition?
- How do the automatically selected models compare to your model from Step 2?
- Which model will you choose to proceed with?
Based on MedianAge, studyPerCap, PctPublicCoverage being statistically insignificant, I anticipate the automated selection process will suggest them being removed.
Running the fastbw() method gives us the following
output:
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC R2
## studyPerCap 0.01 1 0.9113 0.01 1 0.9113 -1.99 0.462
## MedianAge 0.75 1 0.3858 0.76 2 0.6823 -3.24 0.462
## PctPublicCoverage 2.07 1 0.1507 2.83 3 0.4186 -3.17 0.462
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept 1.056e+02 4.262e+00 24.785 0.0000000
## avgAnnCount -8.499e-04 2.797e-04 -3.039 0.0023734
## incidenceRate 2.334e-01 6.939e-03 33.633 0.0000000
## medIncome -1.703e-04 5.075e-05 -3.356 0.0007901
## PctHS25_Over 9.010e-01 6.254e-02 14.407 0.0000000
## PctPrivateCoverage -8.455e-01 5.185e-02 -16.307 0.0000000
##
## Factors in Final Model
##
## [1] avgAnnCount incidenceRate medIncome PctHS25_Over
## [5] PctPrivateCoverage
the three predictors initially believed to be the ones removed are
indicated by fastbw() as having p-values above our
indicated \(\alpha\). Estimated
p-values for remaining predictors can be seen.
Creating a model with suggested predictors from
fastbw():
##
## Call:
## lm(formula = TARGET_deathRate ~ avgAnnCount + incidenceRate +
## medIncome + PctHS25_Over + PctPrivateCoverage, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -115.655 -12.173 0.442 11.849 140.284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.056e+02 4.262e+00 24.786 < 2e-16 ***
## avgAnnCount -8.499e-04 2.796e-04 -3.039 0.00239 **
## incidenceRate 2.334e-01 6.939e-03 33.634 < 2e-16 ***
## medIncome -1.703e-04 5.075e-05 -3.356 0.00080 ***
## PctHS25_Over 9.010e-01 6.254e-02 14.407 < 2e-16 ***
## PctPrivateCoverage -8.455e-01 5.185e-02 -16.308 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.38 on 3041 degrees of freedom
## Multiple R-squared: 0.4618, Adjusted R-squared: 0.4609
## F-statistic: 521.8 on 5 and 3041 DF, p-value: < 2.2e-16
The fastbw() model has an \(R^2\) value of .4618 which is just .0005
smaller than the original model, with 3 less variables. The originally
selected model and the fastbw() model have identical \(R^2_a\) values
Running the AIC method gives us the following
output:
## Start: AIC=18378.78
## TARGET_deathRate ~ avgAnnCount + incidenceRate + medIncome +
## MedianAge + studyPerCap + PctHS25_Over + PctPrivateCoverage +
## PctPublicCoverage
##
## Df Sum of Sq RSS AIC
## - studyPerCap 1 5 1261448 18377
## - MedianAge 1 314 1261757 18378
## - PctPublicCoverage 1 782 1262225 18379
## <none> 1261443 18379
## - avgAnnCount 1 3674 1265117 18386
## - medIncome 1 5472 1266915 18390
## - PctHS25_Over 1 85203 1346646 18576
## - PctPrivateCoverage 1 96280 1357723 18601
## - incidenceRate 1 466413 1727856 19335
##
## Step: AIC=18376.79
## TARGET_deathRate ~ avgAnnCount + incidenceRate + medIncome +
## MedianAge + PctHS25_Over + PctPrivateCoverage + PctPublicCoverage
##
## Df Sum of Sq RSS AIC
## - MedianAge 1 312 1261760 18376
## - PctPublicCoverage 1 785 1262233 18377
## <none> 1261448 18377
## - avgAnnCount 1 3700 1265148 18384
## - medIncome 1 5473 1266921 18388
## - PctHS25_Over 1 85978 1347426 18576
## - PctPrivateCoverage 1 97325 1358773 18601
## - incidenceRate 1 468296 1729744 19337
##
## Step: AIC=18375.54
## TARGET_deathRate ~ avgAnnCount + incidenceRate + medIncome +
## PctHS25_Over + PctPrivateCoverage + PctPublicCoverage
##
## Df Sum of Sq RSS AIC
## <none> 1261760 18376
## - PctPublicCoverage 1 857 1262618 18376
## - avgAnnCount 1 3663 1265423 18382
## - medIncome 1 5532 1267292 18387
## - PctHS25_Over 1 85879 1347639 18574
## - PctPrivateCoverage 1 97913 1359673 18601
## - incidenceRate 1 468174 1729934 19335
The stepAIC() selection removed studyPerCap in the first
iteration, and medianAge in the second iteration. On the third pass
through there was not a variable that could be removed that would lower
the AIC score.
The AIC selection process removed two of the three variables I
thought would be removed. The stepAIC() process did not
remove the Pct Public Coverage variable.
Creating a new model with the AIC selection suggestions yields the following output:
##
## Call:
## lm(formula = TARGET_deathRate ~ avgAnnCount + incidenceRate +
## medIncome + PctHS25_Over + PctPrivateCoverage + PctPublicCoverage,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -116.893 -12.011 0.467 11.702 140.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.123e+02 6.280e+00 17.876 < 2e-16 ***
## avgAnnCount -8.315e-04 2.799e-04 -2.971 0.002995 **
## incidenceRate 2.345e-01 6.983e-03 33.586 < 2e-16 ***
## medIncome -1.997e-04 5.470e-05 -3.651 0.000266 ***
## PctHS25_Over 9.207e-01 6.400e-02 14.384 < 2e-16 ***
## PctPrivateCoverage -8.807e-01 5.734e-02 -15.359 < 2e-16 ***
## PctPublicCoverage -1.155e-01 8.033e-02 -1.437 0.150727
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.37 on 3040 degrees of freedom
## Multiple R-squared: 0.4621, Adjusted R-squared: 0.4611
## F-statistic: 435.3 on 6 and 3040 DF, p-value: < 2.2e-16
The stepAIC() model has a .0002 \(R^2\) value compared to the original model,
but it has a higher \(R^2_a\) of
.0002.
The AIC selection did not remove PctPublicCoverage. It had a p-value of .16 in the original model, but with the other two predictors removed it dropped to .15, still 3x higher than the selected \(\alpha\) value. I assumed Everything that was statistically insignificant would also lead to a large enough AIC reduction when removed from the model.
Moving forward I don’t believe leaving Pct Public Coverage variable
in the model adds much. I will use the fastbw() selected
model going forward, which has the following variables:
To check the mathematical assumptions of the model we will perform
diagnostics on the model chosen in Step Three, in this case the
fastbw() selected model. Checking the assumptions relies on
using the model’s residuals, or the difference between the observed, or
actual, value and the value that the model predicts.
To check for Heteroscedasicity within the model we will look at the residuals. Specifically we will look to make sure there is constant variance between the residual points.
First we will conduct the Breush-Pagan test. The statistical test tests for the following:
\(H_0: \text{homoscedasticity}\)
\(H_a: \text{heterocedasticity}\)
##
## studentized Breusch-Pagan test
##
## data: model_fastbw_selection
## BP = 70.308, df = 5, p-value = 8.841e-14
Looking just at the bptest() we would conclude that the model assumptions are not met and we can reject the null and state there is not constant variance.
The other test we conduct to check for heteroscedasticity is by plotting the model’s fitted values compared against the model’s residuals.
The Fitted values vs. residuals plot looks to be circular and no real
trends appearing in the plot, this contradicts the bptest()
and would indicate the constant variance should be upheld. The reason
for this is that the hypothesis test can be impacted by larger datasets.
The 3047 observations can certainly do this. Thus, we can confidently
say that the model’s assumption of constant variance is upheld.
To check for independence in the residuals we will also look at the residuals compared to the fitted values.
First we will conduct the Durban-Watson test which conducts the following hypothesis test:
\(H_0: \text{residuals are independent}\)
\(H_a: \text{residuals are not independent}\)
##
## Durbin-Watson test
##
## data: model_fastbw_selection
## DW = 1.6812, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
The dwtest() has given us a p-value to say we can reject
the null and state there is not independence within the residuals.
We move to the plot.
As it is the same plot as before, we not there is no trend, indicating there is no correlation between the residuals and the fitted values. Thus we can confidently say that the model’s assumption of independence is upheld.
To check if the residuals are normal distributed we will conduct the Shapiro-Wilks Test, which conducts the following hypothesis test:
\(H_0: \text{residuals are normal}\)
\(H_a: \text{residuals are NOT normal}\)
##
## Shapiro-Wilk normality test
##
## data: model_fastbw_selection$residuals
## W = 0.98365, p-value < 2.2e-16
The shapiro.test() p-value would indicate that we have
evidence to reject the null and state there is not normality in the
residuals. Again, similar to the above statistical tests, the large
dataset is influencing the shapiro.test(). We will turn to
the Q-Q plot. Which will plot the model’s residuals against a straight
line
The Q-Q plot does not have any gross derivations from normality within the residuals and the model assumption should be upheld.
To conduct this test, we will use the Lagged residual plot to make sure there is trend between the residuals and a lagged version of themselves.
Lagged residual plot does not have any positive or negative trends, indicating there is not be any serial correlation. This could indicate an issue with the model assumptions
We need to check to see if there are any particular observations in the dataset that are influencing the model to the point where the linear equation is drawn to those points. We will examine this by looking to see if there are any outliers, by looking for standard residuals and, and we will look to see if there are influential by using a method called Cooks distance.
To look at the standard residuals, we must use a function called
rstandard() which will calculate the values of each
residual in the model. Values over |3| are considered to be an outlier.
We will put these values into a dataframe for easy manipulation.
We now will filter to see if there are any values that would be considered an outlier
## [1] 116 122 254 282 522 650 775 812 912 1048 1059 1221 1331 1366 1497
## [16] 1942 2066 2176 2549 2600 2637 2646 2659 2714 2727
There are 25 total points that would be considered outliers, or .82% of the total dataframe.
Checking the values
## [1] 3.3981 3.5565 4.1182 -5.8087 3.1296 -3.1029 3.5457 -3.3658 -3.0171
## [10] -3.1095 -3.8121 6.8959 -3.0398 5.3300 3.4729 -3.3831 -3.1122 3.1660
## [19] 3.4358 3.2229 3.0294 -4.7967 -3.9332 4.3117 3.3617
Some of these values have sizable standardized residual values. But we should look to see with cooks distance if any of them have leverage and affect the model.
Using cooks distance we have two options to determine if a point has influence. If the value calculated for the model residuals are above 1 we would say they are influential.
Using the cooks.distance() function on the model we will
create a vector of the cooks distance values. Running that and a line to
code to see what the highest value’s index is and what the value is:
## 282
## 0.2661295
A rule of thumb of leverage would be if the cooks distance is over 1. The largest value of the distances is no where near 1.
As another check we can use the F-threshold, which takes the 50th percentile of the F-Distribution using the number of observations in the model and the number of predictors in the model. The value can be seen below.
## [1] 0.891551
Now we can see if there are any cooks distances above the F-threshold:
## named integer(0)
Since there are no values above the 50th percentile threshold of the F Distribution, we can say that while there are some values that could be considered outliers, there are none that appear to have any influence/leverage and are affecting the model.
We will investigate if a model transformation might correct the model if mathematical assumptions of the model were not met in Step Four.
The max value looks to be close to .75, but not close enough to 1 to say there would not be any benefit from a transformation.
## [1] 0.7878788
Boxcox method has suggested a value of .7879 for \(\lambda\)
Now creating a model with a Boxcox lambda transformation applied to the response variable, deathRate. We can see the model summary for the transformed model below:
##
## Call:
## lm(formula = (TARGET_deathRate)^lambda ~ avgAnnCount + incidenceRate +
## medIncome + PctHS25_Over + PctPrivateCoverage, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.226 -3.122 0.197 3.160 35.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.985e+01 1.119e+00 35.615 < 2e-16 ***
## avgAnnCount -2.103e-04 7.343e-05 -2.863 0.004221 **
## incidenceRate 6.117e-02 1.822e-03 33.573 < 2e-16 ***
## medIncome -4.568e-05 1.333e-05 -3.428 0.000616 ***
## PctHS25_Over 2.404e-01 1.642e-02 14.641 < 2e-16 ***
## PctPrivateCoverage -2.177e-01 1.361e-02 -15.994 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.351 on 3041 degrees of freedom
## Multiple R-squared: 0.4607, Adjusted R-squared: 0.4599
## F-statistic: 519.7 on 5 and 3041 DF, p-value: < 2.2e-16
The Boxcox transformed model have slightly lower \(R^2\) and \(R^2_a\) values.
Now we will plot the Boxcox transformed model’s fitted values against
its residuals to see if there was any improvement,
There does not seem to by any change in the diagnostic plot from the
original fastbw_model_selection and the transformed
fbw_bc model. This appears to prove the model did already
met the mathematical assumptions of a linear model.
We will report the final model and use it to perform inferences.
fastbw() model selection##
## Call:
## lm(formula = TARGET_deathRate ~ avgAnnCount + incidenceRate +
## medIncome + PctHS25_Over + PctPrivateCoverage, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -115.655 -12.173 0.442 11.849 140.284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.056e+02 4.262e+00 24.786 < 2e-16 ***
## avgAnnCount -8.499e-04 2.796e-04 -3.039 0.00239 **
## incidenceRate 2.334e-01 6.939e-03 33.634 < 2e-16 ***
## medIncome -1.703e-04 5.075e-05 -3.356 0.00080 ***
## PctHS25_Over 9.010e-01 6.254e-02 14.407 < 2e-16 ***
## PctPrivateCoverage -8.455e-01 5.185e-02 -16.308 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.38 on 3041 degrees of freedom
## Multiple R-squared: 0.4618, Adjusted R-squared: 0.4609
## F-statistic: 521.8 on 5 and 3041 DF, p-value: < 2.2e-16
Now we will look the parameter estimates and p-values for the final
model:
| TARGET death Rate | ||
|---|---|---|
| Predictors | Estimates | p |
| (Intercept) | 105.63 | <0.001 |
| avgAnnCount | -0.00 | 0.002 |
| incidenceRate | 0.23 | <0.001 |
| medIncome | -0.00 | 0.001 |
| PctHS25 Over | 0.90 | <0.001 |
| PctPrivateCoverage | -0.85 | <0.001 |
| Observations | 3047 | |
| R2 / R2 adjusted | 0.462 / 0.461 | |
## [1] 0.461769
Compute and report a 95% confidence interval for the slope of whichever predictor you feel is most important. - We will compute the interval for the PctHS25_Over variable
## [1] -0.778375 1.023625
We are 95% confident that the slope for PctHS25Over is
between -.778375 and 1.023625.
We are computing the interval for the medians for each of the models predictors. - avgAnnCount = 171 - incidenceRate = 453.55 - medIncome = 45207 - PctHS25_Over = 35.3 - PctPrivateCoverage = 65.1
## fit lwr upr
## 1 180.3991 179.6135 181.1846
We are 95% confident that a county’s target_Deathrate who has an median avgAnnCount of 171, a median incidenceRate of 453.55, a median medIncome of 45207, a median PctHS25_Over of 35.3, and a median PctPrivateCoverage of 65.1 will lie in the range between 179.61 and 181.1846.
The prediction interval is as follows:
## fit lwr upr
## 1 193.1166 153.1524 233.0809
There is a 95% probability that the target_deathrate of a county with avgAnnCount of 155, incidenceRate of 467.1, medIncome of 39303, PctHS25_Over of 39.8, and PctPrivateCoverage of 59.8 will lie in the range between 153.15 and 233.08.
The model itself, while being initially having some selection issues, the final selected model was able to explain 46.18% of the variation in the response variable Death rate. The final model equation was: \[\hat{y} = 105.63 - 2.103e-04x_{avgAnnCount} + .23x_{incidenceRate} -4.568e-05x_{MedIncome}+.9x_{PctHS25Over}-0.85x_{PctPrivateCoverage}\]
Showing that the most influential variables were dealing with education and insurance. More specifically we would expect for every 1 percent increase of total residents in a county that had a highest level of education being high school we expect the county’s cancer death rate to increase .9 per capita of 100,000 residents. And for the Private insurance coverage, we would expect for every 1 percent of the total county residents that increased, we would expect the cancer death rate for the county to decrease by .85 per capita of 100,000 residents.
While these might be some good insights, it would be worthwhile to investigate the other variables in the dataset more thoroughly. There were 22 other variables that could lead to a model that does a better job of explaining what factors could explain cancer death rates more thoroughly.